Skip to content

PHOENIX-7845 ReplicationLogGroup initialization resilience to standby cluster unavailability#2466

Merged
tkhurana merged 3 commits into
apache:PHOENIX-7562-feature-newfrom
tkhurana:PHOENIX-7562-feature-new
May 11, 2026
Merged

PHOENIX-7845 ReplicationLogGroup initialization resilience to standby cluster unavailability#2466
tkhurana merged 3 commits into
apache:PHOENIX-7562-feature-newfrom
tkhurana:PHOENIX-7562-feature-new

Conversation

@tkhurana
Copy link
Copy Markdown
Contributor

@tkhurana tkhurana commented May 7, 2026

Summary

  • Defer peer shard manager creation from ReplicationLogGroup.init() to individual mode consumers (SyncModeImpl, SyncAndForwardModeImpl, ReplicationLogDiscoveryForwarder)
  • When the standby namenode is unavailable at startup, the group gracefully degrades from SYNC to STORE_AND_FORWARD via the existing updateModeOnFailure() path
  • Change ReplicationLogGroup.get() to throw IOException instead of RuntimeException, so callers in IndexRegionObserver get proper error classification (not misclassified as IndexBuildingFailureException)
  • Forwarder lazily creates its peer shard manager with automatic retry on subsequent rounds
  • Replace createStandbyLog()/createFallbackLog() with single createReplicationLog(shardManager) factory

Test plan

  • testInitDegradesToSafWhenPeerUnavailable — peer unavailable at startup → SAF mode, writes succeed
  • testInitFailsWhenLocalUnavailable — local FS unavailable → init fails with IOException
  • testForwarderRetriesPeerCreation — forwarder retries peer shard manager on next round after initial failure
  • All 41 existing ReplicationLogGroupTest tests pass
  • All 3 ReplicationLogDiscoveryForwarderTest tests pass
  • All 2 ReplicationLogTest tests pass
  • HABaseIT integration tests (running)

@tkhurana tkhurana requested review from apurtell and kadirozde May 7, 2026 22:16
…eption

- Replace RuntimeException wrapping in get() with UncheckedIOException to
  avoid misclassifying unrelated RuntimeExceptions with IOException causes
- Bound peer shard manager creation with configurable timeout (default 10s)
  via CompletableFuture.get() to prevent blocking the disruptor handler
  thread on peer NN outage; TimeoutException triggers SAF degradation
- Consolidate peer shard manager into a single lazy synchronized accessor
  on ReplicationLogGroup; remove per-component caching from forwarder
- Cancel the in-flight future on timeout to release the ForkJoinPool thread
- Add test for timeout-triggered SAF degradation
Copy link
Copy Markdown
Contributor

@apurtell apurtell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 with some minor comments

try {
peerShardManager = future.get(peerInitTimeoutMs, TimeUnit.MILLISECONDS);
return peerShardManager;
} catch (UncheckedIOException e) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. future.get() won't throw UncheckedIOException. When the supplier throws unchecked, the future fails and Future.get() wraps the cause in ExecutionException. You can unwrap the UncheckedIOException from ExecutionException where you need it.

}
throw new IOException("Failed to create peer shard manager", e.getCause());
} catch (TimeoutException e) {
future.cancel(true);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

future.cancel(true) does not actually free the worker thread. (Per CompletableFuture Javadoc, mayInterruptIfRunning has no effect i.e. interrupts are not used to control processing.) The stall is moved to the worker pool, which is better, but the stall still holds resources the entire time the underlying HDFS call is blocked. Every subsequent call to this method spawns another supplyAsync task while the previous one is still hung, so this issue can compound and potentially exhaust the worker pool.

I asked the robot about this and it recommended: Use a dedicated ExecutorService (single-thread, bounded queue) so future.cancel(true) can actually interrupt, and so leakage is bounded.

if (cached != null) {
return cached;
}
synchronized (this) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With both the disruptor handler and the forwarder thread involved we get a mutual exclusion here that may be longer than 10s.

- Remove dead catch(UncheckedIOException) block; future.get() wraps
  supplier exceptions in ExecutionException, not UncheckedIOException
- Unwrap UncheckedIOException inside ExecutionException handler to
  recover the original IOException
- Cache the in-flight CompletableFuture to prevent spawning unbounded
  async tasks when peer HDFS is unavailable; reuse the pending future
  on timeout, retry only after exceptional completion
- Remove no-op future.cancel(true) which has no effect on
  CompletableFuture (mayInterruptIfRunning is ignored)
@tkhurana tkhurana merged commit dfc14db into apache:PHOENIX-7562-feature-new May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants